41 research outputs found

    Crowdsourcing historical tabular data : 1961 census of England and Wales

    Get PDF
    This paper describes how crowdsourcing can be incorporated as an integral part of a comprehensive technical workflow to identify, extract and validate data from large volumes of printed tabular statistics, and transform them into operable digital datasets using current structural and descriptive standards. The recently completed digitisation project for the 1961 Census of England and Wales (commissioned by the UK's Office for National Statistics) is used to provide details on data processing, crowdsourcing platform and tasks, crowd interaction, and validation of results. The multi-modal approach employed was very successful, delivering far more complete and validated data than automated processes alone could produce (due to the challenging nature of the source material)

    Efficient and effective OCR engine training

    Get PDF
    We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail

    Highlights of the novel dewaterability estimation test (DET) device

    Get PDF
    Many industries, which are producing sludge in large quantities, depend on sludge dewatering technology to reduce the corresponding water content. A key design parameter for dewatering equipment is the capillary suction time (CST) test, which has, however, several scientific flaws, despite that the test is practical and easy-to-perform. The standard CST test has a few considerable drawbacks: its lack of reliability and difficulties in obtaining results for heavy sludge types. Furthermore, it is not designed for long experiments (e.g. >30 min), and has only two measurement points (its two electrodes). In comparison, the novel dewaterability estimation test (DET) test is almost as simple as the CST, but considerably more reliable, faster, flexible and informative in terms of the wealth of visual measurement data collected with modern image analysis software. The standard deviations associated with repeated measurements for the same sludge is lower for the DET than for the CST test. In contrast to the CST device, capillary suction in the DET test is linear and not radial, allowing for a straightforward interpretation of findings. The new DET device may replace the CST test in the sludge-producing industries in the future

    Semantics-enriched workflow creation and management system with an application to document image analysis and recognition

    Get PDF
    Scientific workflow systems are an established means to model and execute experiments or processing pipelines. Nevertheless, designing workflows can be a daunting task for users due to the complexities of the systems and the sheer number of available processing nodes, each having different compatibility/applicability characteristics. This Thesis explores how concepts of the Semantic Web can be used to augment workflow systems in order to assist researchers as well as non-expert users in creating valid and effective workflows. A prototype workflow creation/management system has been developed, including components for ontology modelling, workflow composition, and workflow repositories. Semantics are incorporated as a lightweight layer, permeating all aspects of the system and workflows, including retrieval, composition, and validation. Document image analysis and recognition is used as a representative application domain to evaluate the validity of the system. A new semantic model is proposed, covering a wide range of aspects of the target domain and adjacent fields. Real-world use cases demonstrate the assistive features and the automated workflow creation. On that basis, the prototype workflow creation/management system is compared to other state-of-the-art workflow systems and it is shown how those could benefit from the semantic model. The Thesis concludes with a discussion on how a complete infrastructure based on semantics-enriched datasets, workflow systems, and sharing platforms could represent the next step in automation within document image analysis and other domains

    ICFHR 2018 Competition on recognition of historical Arabic scientific manuscripts - RASM2018

    Get PDF
    This paper presents an objective comparative evaluation of page analysis and recognition methods for historical scientific manuscripts with text in Arabic language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICFHR2018, presenting the results of the evaluation of six methods – three submitted and three baseline systems. The challenges for the participants included page segmentation, text line detection, and optical character recognition (OCR). Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that, despite the challenging nature of the material, useful digitisation outputs can be produced
    corecore